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Abstract 

We  propose  two  finite  difference  two-timescale  simultaneous  perturbation  stochastic  approx¬ 
imation  (SPSA)  algorithms  for  simulation  optimization  of  hidden  Markov  models.  Stability 
and  convergence  of  both  the  algorithms  is  proved.  Numerical  experiments  on  a  queueing  model 
with  high  dimensional  parameter  vectors  demonstrate  orders  of  magnitude  faster  convergence 
using  these  algorithms  over  related  ( N  +  l)-Simulation  finite  difference  analogues  and  another 
Two- Simulation  finite  difference  algorithm  that  updates  in  cycles. 
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1  Introduction 


A  popular  approach  to  simulation  optimization  of  discrete  event  systems  with  continuous-valued 
parameters  is  based  on  stochastic  approximation  [1],  [20],  [22],  [27].  Gradient  descent  stochastic 
approximation  algorithms  are  typically  used  to  perform  function  optimization  in  cases  where  the 
function  to  be  optimized  is  difficult  to  compute  analytically  [16].  Several  stochastic  approximation 
schemes  that  have  been  used  for  optimization  of  long  run  average  performance  measures  suffer 
from  the  drawback  that  they  require  data  aggregation  (for  averaging)  over  regeneration  epochs 
[13],  [15].  These  epochs  can  be  very  infrequent  (particularly  for  large  systems/networks),  making 
the  scheme  extremely  slow  in  practice.  In  [12],  an  algorithm  that  updates  the  parameter  after  a 
fixed  number  of  customers  is  presented  and  convergence  proved.  However,  the  algorithms  in  [12], 
[13]  and  [15]  all  require  the  availability  of  direct  gradient  estimates,  and  are  all  based  on  infinitesimal 
perturbation  analysis  [19].  In  [23]  and  [24],  various  stochastic  approximation  algorithms  governed 
by  finite  difference  estimates  (as  well  as  direct  gradient  estimates)  were  considered  for  optimizing 
a  steady-state  performance  measure  with  respect  to  a  scalar  parameter  in  a  single-server  queue. 

In  [3],  [4]  and  [5],  a  more  general  setting  for  vector  parameters  and  long-run  average  perfor¬ 
mance  measures  is  considered,  in  which  the  parameter  is  updated  at  deterministic  instants  that 
are  obtained  using  two  timescales;  a  faster  timescale  at  which  the  system  evolves,  and  a  slower 
timescale  at  which  the  parameter  is  updated.  Specifically,  in  [4],  the  parameter  is  updated  at  de¬ 
terministically  increasing  time  instants  that  are  in  turn  obtained  using  two  timescales  (see  also  [2]). 
On  the  other  hand,  in  [5],  the  parameter  is  updated  at  every  instant  using  coupled  iterations  that 
are  governed  by  different  timescales.  However  as  with  any  other  forward  finite  difference  scheme 
[23],  [24],  these  schemes  also  require  ( N  + 1)  parallel  simulations  for  an  IV-vector  parameter.  Thus, 
if  the  parameter  dimension  ( N )  is  large,  the  corresponding  number  of  simulations  required  to  ob¬ 
tain  the  optimum  parameter  is  large  as  well.  A  proposed  alternative  in  [4]  uses  only  two  parallel 
simulations  at  any  instant  by  moving  the  algorithm  in  bigger  loops  or  cycles.  This,  however,  results 
in  slow  convergence. 

Spall  [28]  proposed  a  stochastic  approximation  technique  that  requires  only  two  simulations  for 
a  parameter  vector  of  any  dimension  and  updates  all  parameter  components  at  every  instant.  This 
technique  came  to  be  known  as  simultaneous  perturbation  stochastic  approximation  (SPSA),  since 
it  simultaneously  perturbs  the  various  parameter  components  randomly,  most  commonly  by  using 
independent  and  identically  distributed  (i.i.d.),  symmetric  Bernoulli  random  variables  in  the  two 
simulations,  and  uses  the  estimates  thus  obtained  for  updating  the  parameter.  It  has  been  applied 
in  various  contexts;  for  instance,  see  [17]  for  an  application  of  SPSA  to  optimization  of  discrete 
event  systems.  Most  of  the  work  in  discrete  event  systems  (except  [17])  is  on  low-dimensional 
problems  (cf.  [4],  [24], [23],  [12], [21]).  In  [28]  and  [22],  a  general  idea  for  high-dimensional  problems 
is  proposed. 

Motivated  by  all  of  the  above  considerations,  in  this  paper  we  develop  SPSA  variants  (that  we 
call  SPSA-1  and  SPSA-2)  of  the  two-timescale  algorithms  of  [4]  and  [5]  (respectively)  for  optimizing 
high-dimensional  parameters  in  hidden  Markov  models.  The  algorithm  SPSA-1  uses  only  two 
parallel  simulations  and  updates  the  parameter  at  increasing  time  instants  as  in  [4] .  The  algorithm 
SPSA-2  also  uses  only  two  parallel  simulations  but  has  the  added  advantage  (over  the  algorithm 
in  [5])  that  it  allows  for  data  aggregation  over  a  fixed  number  of  instants  in  between  two  successive 
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updates  of  the  parameter  for  better  performance.  Moreover,  the  algorithm  in  [5]  was  only  for 
ordinary  Markov  processes  and  not  hidden  Markov  models  (which  is  a  more  general  setting)  [14] 
considered  here.  We  prove  the  convergence  of  both  of  these  schemes  and  numerically  demonstrate 
the  algorithms  on  a  feedback  queueing  network  with  high-dimensional  parameters.  These  schemes 
are  found  to  converge  orders  of  magnitude  faster  than  their  (N  +  l)-Simulation  analogues  in  [4] 
and  [5],  and  also  the  Two-Simulation  algorithm  of  [4]  that  moves  in  cycles. 

Hidden  Markov  models  arise  in  several  queueing  and  stochastic  control  applications.  To  il¬ 
lustrate  a  very  simple  instance  of  a  hidden  Markov  model,  consider  an  M/G/l  queue.  Let  {gn} 
represent  the  queue  length  process  observed  at  customer  arrival  epochs.  Similarly,  let  { rn }  repre¬ 
sent  the  sequence  of  residual  service  times  of  the  customers  in  service  at  these  epochs.  Then  the 
joint  process  {(qn,rn)}  is  Markov.  In  most  real  life  applications  only  the  process  {gn}  is  observed 
whereas  {rn}  is  not.  Thus  in  this  example,  {qn}  represents  a  hidden  Markov  model. 

In  [7],  we  applied  algorithms  SPSA-1  and  a  special  case  of  SPSA-2  for  the  closed  loop  feedback 
control  of  available  bit  rate  (ABR)  service  in  asynchronous  transfer  mode  (ATM)  networks,  by 
considering  parameterized  feedback  policies.  A  finite  state  setting  was  considered  there  and  as  a 
result  there  was  no  problem  with  stability  of  the  schemes.  We  develop  these  algorithms  in  this 
paper  in  the  framework  of  hidden  Markov  models  with  an  unbounded  state  space,  and  therefore 
stability  issues  are  explicitly  addressed.  The  convergence  of  SPSA-1  in  the  finite  state  setting  of  [7] 
was  proven  in  [6].  The  convergence  analysis  of  SPSA-2  in  any  setting  has  not  been  shown  earlier. 

The  rest  of  the  paper  is  organized  as  follows:  In  the  next  section,  we  formulate  the  optimization 
problem,  present  the  assumptions  on  the  system  and  give  a  result  on  tightness  of  the  stationary 
measures  for  the  hidden  Markov  model.  In  Section  3,  we  present  our  SPSA  algorithms  (SPSA-1 
and  SPSA-2)  and  compare  their  performance  with  their  corresponding  analogues  in  [4]  and  [5].  In 
Sections  4  and  5,  we  present  the  detailed  convergence  analyses  for  both  algorithms.  In  Section  6, 
numerical  results  comparing  the  SPSA  algorithms  with  those  of  [4]  and  [5]  for  a  simple  queueing 
system  are  presented.  Finally,  Section  7  provides  the  concluding  remarks. 

2  The  Optimization  Problem 

The  process  that  we  seek  to  optimize  is  an  7 Zd  -  valued  parameterized  (with  parameter  9  €  1ZN) 
hidden  Markov  model  (HMM)  represented  by  {Y (j),j  >  0}  and  is  given  by  the  set  of  coupled 
iterations 

X(j  +  1)  =  F(X(j),  0),  (2.1) 

Y(j  +  l)  =  G(X(j),Y(j),v(j),9),  (2.2) 

j  >  0.  Here  the  state  process  {X(j)}  is  7Zq  -  valued  and  is  unobserved  or  hidden.  Further,  {£(j)|, 
{vU)}  are  i-i-d.  sequences  in  1Zl  and  TZS,  respectively,  and  are  mutually  independent  of  one  another. 
The  maps  F  :  lZq  x  7 Zd  x  7Zl  x  1Z N  — >  7Zq  and  G  :  7Zq  x  lZd  x  TZS  x  1Z N  — >  7Zd  are  measurable.  Also, 
6  —  (9 1, . . . ,  9jy)T  €  1ZN  represents  the  parameter  to  be  tuned  in  order  to  minimize  a  certain  cost 
function  J{6)  (defined  below).  We  will  assume  that  9  takes  values  in  a  compact  set  C  C  7 ZN ,  and 
that  C  is  of  the  form  C  =  We  assume  that  the  joint  process  {(X(j),Y(j))} 

is  ergodic  Markov  for  every  fixed  9,  and  has  stationary  distribution  fig(dx,dy)  G  V(JZq+d).  Let 
ug(dy )  be  the  marginal  of  this  stationary  distribution  in  V(JZd).  Here,  77(...)  represents  the  space 
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of  all  probability  measures  on  Also,  let  pg(x,y;dx' ,dy')  represent  the  transition  kernel  of 

{(X(j),  Y (.]))}■  Let  h  :  1Zd  — >•  1Z  be  a  given  bounded  and  continuous  cost  function.  Our  aim  then 
is  to  find  a  9  in  the  set  C  that  minimizes  the  average  cost 

l 

J{9)  =  lim  —  y  h(Y(j)). 

V  oo  N  4-1 

J=0 

Since  {(X(j),  Y (j))}  is  ergodic  Markov  (for  fixed  9),  the  above  limit  exists  and 

J(9)  =  j  h{y)ve{dy). 

Remark  As  shown  in  [8],  very  general  classes  of  random  processes  have  an  HMM  representation 
if  one  is  permitted  time  nonhomogeneity  viz.,  replace  F  and  G  in  (2.1)-(2.2)  above  with  Fj ,  Gj, 
j  >  0.  Also,  {£(j)}  and  can  be  taken  to  be  uniformly  distributed  on  [0, 1]  without  loss  of 

generality. 

We  now  proceed  with  the  rest  of  the  model.  As  mentioned  earlier,  the  parameter  6  in  (2.1)-(2.2) 
has  to  be  tuned  in  order  to  minimize  J{9).  Let  9n  £  C  represent  the  parameter  value  at  instant  n. 
Thus  in  particular,  we  will  consider  the  following  dynamics: 

X(j  +  1)  =  F(X(j),Y(j),£  CM),  (2-3) 

Y(j  +  l)  =  G(X(j),Y(j),rj(j),Oj),  (2-4) 

j  >  0.  Let  Qn  —  a(X(j),Y(j),  £(j),  rj(j),  9j,  j  <  n ),  n  >  0,  represent  the  natural-filtration 
generated  by  {X(j),  Y(j),  £(j),  r) (j),  9j,j  >  0}.  Note  that  since  { 9 3 }  could  be  any  arbitrarily 
defined  parameter  sequence,  {(X(j),  Y(j))}  defined  simply  by  (2.3)-(2.4)  would  in  general  not  be 
Markov.  We  thus  force  {(X(j),Y(j))}  to  be  Markov  by  assuming  that  any  sequence  {9j  J  >  0} 
satisfies 

P(X(n  +  l)  G  A,Y(n  +  l)  <EB  \  Qn)  =  p§n(X(n),  T(n);  A,  B),  (2.5) 

for  any  A,  B,  Borel  in  lZq  and  lZd  respectively.  We  call  any  such  {9n}  satisfying  (2.5),  an  M- 

Sequence. 

Next,  we  list  the  following  assumptions  on  our  system.  Assumptions  (Al),  (A2)  and  (A3) 
are  needed  to  prove  convergence  of  our  first  algorithm  SPSA-1,  while  (Al),  (A2')  and  (A3)  are 
required  for  convergence  of  algorithm  SPSA-2.  Both  algorithms  SPSA-1  and  SPSA-2  are  presented 
in  Section  3. 

Assumptions 

(Al)  The  average  cost  J{9)  is  continuously  differentiable. 

(A2)  The  map  ( 9,x,y )  £  7 ^N+g+d  — »  pe(x,  y- dx' ,  dy')  £  V(lZq+d)  is  continuous. 

(A2;)  For  any  h  £  C(TZq+d)  vanishing  at  00, 

lim  pg(x,y;dx,,dy')h(x,,y')  =  0 
W(x,y)\\^Hx>  J 

uniformly  over  9  £  C. 

(A3)  Liapunov  Stability  Condition:  There  exist  nonnegative  V  £  C(7Zq+d),  K  C  1Zq+d  compact 
and  6q  >  0  such  that  under  any  M-Sequence  {9n}, 
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!•  ^\\(x,y)\\^ooV(x,y)  =  oo, 

2.  sup nE[V(X(n),Y(n))2}  <  oo, 

3.  E[V(X(n  +  1  ),Y(n  +  1))  |  Qn\  <  V{X(n ),  F(n))  -  e0, 
whenever  (X(n),Y(n))  ^  K,  n  >  0. 

Note  that  a  sufficient  condition  for  (Al)  to  be  satisfied  is  that  the  parameterized  stationary  dis¬ 
tribution  fig(dx,dy )  of  the  ergodic  Markov  process  {(X(n),  Y(n))}  be  continuously  differentiable 
in  the  parameter  8.  In  [29],  certain  sufficiency  conditions  for  showing  the  latter  have  been  given. 
Assumptions  (A2)  and  (A2/)  are  technical  conditions  that  are  satisfied  routinely  by  most  systems 
[4],  [5].  Finally,  Assumption  (A3)  is  required  to  ensure  that  the  system  remains  stable,  and  is  a 
standard  assumption.  In  [3],  the  algorithms  of  [4]  and  [5]  were  applied  to  a  closed  loop  feedback 
control  problem  in  asynchronous  transfer  mode  (ATM)  networks  and  all  the  above  assumptions 
were  verified  for  the  system  (also  see  [6]  for  verification  of  Assumptions  (Al),  (A2)  and  (A3)  for  a 
similar  system). 

Recall  that  a  sequence  {fin}  of  probability  measures  on  S  is  tight  if  for  each  e  >  0,  there  exists 
a  compact  set  Ke  C  S  such  that  jln(Ke)  >  1  —  e  for  all  n.  The  following  lemma  shows  that  the  sets 
of  all  parameterized  stationary  distributions  {/U g,8  £  C}  and  their  marginals  {ug,  8  £  C}  are  tight 
and  crucially  uses  Assumption  (A3)  for  its  proof.  This  result  is  required  later  in  the  analysis. 

Lemma  2.1  {/ ug,9  £  C }  (resp.,  {isg,8  £  C})  is  compact  in  V(TZq+d)  (resp.,  V(TZd))  and  the 
map  8  — »  no  (resp.,  vg)  is  continuous. 

Proof  Follows  in  exactly  the  same  manner  as  Lemma  2.1  of  [4].  □ 

3  Two-Timescale  SPSA  Algorithms 

In  this  section,  we  shall  present  our  two-timescale  SPSA  algorithms  and  compare  their  performance 
with  corresponding  two-timescale  finite  difference  algorithms  in  [4]  and  [5].  In  order  to  put  things 
in  proper  perspective  and  to  clearly  bring  out  the  advantages  of  our  SPSA  algorithms,  we  shall 
first  begin  with  the  (N  +  l)-Simulation  finite  difference  stochastic  approximation  algorithm  of  [4] 
that  we  refer  to  as  (N  +  l)-Simulation  FDSA-1  and  its  corresponding  Two-Simulation  alternative 
(proposed  in  that  paper)  referred  here  as  Two-Simulation  FDSA-1.  We  shall  then  present  our 
first  SPSA  algorithm  (SPSA-1).  Later,  we  shall  illustrate  the  (N  +  l)-Simulation  finite  difference 
algorithm  of  [5]  that  we  refer  to  as  (N  +  l)-Simulation  FDSA-2,  followed  by  its  generalized  SPSA 
version  (SPSA-2).  We  shall  then  briefly  compare  the  performance  of  all  these  algorithms  and  argue 
the  reasons  for  the  superior  performance  of  SPSA-1  and  SPSA-2  over  the  algorithms  in  [4]  and  [5]. 

3.1  The  Algorithms 

The  algorithms  presented  here  are  called  two-timescale  algorithms  since  they  are  governed  with 
two  step-size  sequences  (or  timescales)  {a(ra)}  and  {b(n)}  defined  below.  Let  6  >  0  be  a  fixed 
small  constant.  Let  7r,;(x')  =  min(max(  a:).  8j:  max),  i  =  1,...,  N,  denote  the  point  closest 

to  x  £  1Z  in  the  interval  [8j  min .  8im^\  C  1Z,  i  =  1  and  ir(8)  be  defined  by  the  vector 

ir(8)  =  (7Ti(#i), . . . ,  ttn(Qn))T-  Then  n(8)  is  a  projection  of  8  £  1ZN  onto  the  set  C.  Define 
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sequences  {a(n)}  and  {6(b)}  as  follows:  a(0)  =  6(0)  =  1,  a(i )  =  i  1,  b(i )  =  i  “,  i  >  1,  and  with 
-  <  a  <  1.  Then  clearly, 

a(n  +  1)  6(n  +  1) 

t  1,  as  n  — >■  oo,  (3-1) 

a(n)  b(n) 

a(n)  =  ^  6(n)  =  oo,  ^  a(n)2,  ^  6(n)2  <  oo,  a(n)  =  o(6(n)).  (3-2) 


Dehne  >  0}  as  follows:  no  =  1  and  nm+i  =  min{j  >  nm  |  Yli=n,n+ 1®(®)  >  6(nr)},  ui  >  1. 

Then  {nm}  represents  a  deterministically  increasing  sequence  of  points.  In  SPSA-1  (as  also  (IV + 1)- 
Simulation  FDSA-1  and  Two-Simulation  FDSA-1),  {nm}  defines  the  parameter  update  instants  of 
the  algorithm. 

Note  that  any  finite  difference  stochastic  approximation  scheme  ordinarily  requires  (N  + 1)  par¬ 
allel  simulations  for  an  IV-vector  parameter.  The  two-timescale  stochastic  approximation  algorithm 
in  [4]  for  an  IV-vector  parameter  (which  we  call  (IV  +  l)-Simulation  FDSA-1)  is  thus  as  follows: 

(IV  +  l)-Simulation  FDSA-1 

The  first  simulation  corresponds  to  {(X(j),Y  (j))}  and  is  governed  by  {9j}  that  is  in  turn 
defined  by  9j  =  9(m),  for  nm  <  j  <  nm+ 1 .  The  remaining  N  parallel  simulations  are  represented 
by  {(Xi(j),Yi(j))},  i  =  1,...,IV,  and  are  respectively  governed  by  {0l(j)},  i  =  1, . . . ,  IV,  with 
8l(j)  =  8(m )  +  8ei,  for  nm  <  j  <  nm+ 1,  and  where  e*  is  the  unit  vector  with  1  in  the  ?'th  direction. 
The  algorithm  then  is  as  follows:  For  i  =  1, . . . ,  N, 


6i(m  +  1)  =  7 Ti 


au)  ^ - g - J 


j=nm+ 1 


(3.3) 


It  is  clear  that  one  would  require  N  +  1  parallel  simulations  using  this  algorithm.  An  alternative 
scheme  was  proposed  in  [4]  that  uses  only  two  parallel  simulations  at  any  instant.  This,  it  achieves 
by  moving  the  algorithm  in  ‘cycles’,  in  each  of  which  only  one  component  is  updated.  This  scheme 
that  we  call  Two-Simulation  FDSA-1  is  as  follows: 


Two- Simulation  FDSA-1 

The  first  simulation  here  corresponds  to  {(X(j),  Y (j))}  and  is  governed  by  {0j}  where  the 
parameter  9j  is  the  IV-vector  6j  =  . . . ,  9j,n)T  with  8hi  =  9i(m),  for  n^m+i-i  <  j  <  nNm+i, 

i  =  1, . . . ,  IV,  m  >  0  (with  N  being  the  dimensionality  of  the  parameter  vector),  where  9jj  (resp. 
9i(m))  is  the  ith  component  of  9j  (resp.  9(m)).  The  second  simulation  is  now  represented  as 
{(X(j),Y(j))}  and  is  governed  by  {8j}  defined  by  8j  =  8{m)  +  for  nNm+i-i  <  j  <  nNm+i, 
i  =  1, . . . ,  N,  m  >  0.  The  algorithm  is  as  follows:  For  i  =  1, . . . ,  IV, 


9i(m  +  1)  =  7 Ti 


nNm-\-i 

0i(m)  +  aU) 


(h(Y(j))-h(V(j))'j>j 


(3.4) 


Thus  using  this  scheme,  the  whole  parameter  is  updated  once  every  njym  steps  instead  of  the  nm 
steps  required  for  one  update  using  the  (N  +  l)-Simulation  FDSA-1  version  (3.3)  of  it.  Next,  we 
present  our  first  randomized  difference  SPSA  algorithm  (SPSA-1). 

Let  for  any  i  >  0,  A  (I)  €  HN  be  a  vector  of  mutually  independent  and  mean  zero  random 
variables  {Ap*,  . . .  A jv,i},  (viz.,  A (i)  =  (Ai^, . . . ,  A n,i)T)  taking  values  in  a  compact  set  E  C  TZN 
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and  having  a  common  distribution.  We  assume  that  these  random  variables  satisfy  Condition  (B) 
below. 


E 


Condition  (B)  There  exists  a  constant  K  <  oo,  such  that  for  any  l  >  0,  and  i  €  {1, . . .  ,  N }, 


ALi 


<  K. 


Further,  we  assume  that  {A(i)}  is  a  mutually  independent  sequence  with  A (i)  independent  of 
cr(8(l),l  <  i),  the  filtration  generated  by  the  sequence  of  parameter  updates.  Condition  (B)  is  a 
standard  condition  in  SPSA  algorithms.  Minor  variants  of  this  condition  are  for  instance  available 
in  [28],  [11].  Note  that  distributions  like  Gaussian  and  Uniform  are  precluded  while  using  Condition 
(B).  An  important  consequence  of  E  Az“2  <  oo  is  that  P  (A ^  =  0)  =  0. 

We  now  proceed  with  our  first  algorithm,  SPSA-1,  wherein  we  use  only  two  parallel  simulations 
and  update  all  parameter  components  every  nm  instants  by  perturbing  all  of  these  simultaneously 
along  random  directions  in  the  two  simulations. 


SPSA-1 


Consider  two  parallel  simulations  {(Xk(j),  Yk(j))},  k  =  1,2,  respectively  governed  by  {9k}, 
k  =  1,2  as  follows:  For  the  process  {(X1^'),  y1^'))},  we  define  9j  =  9{m)  —  5A(m),  for  nm  < 
j  <  nm_ |_i,  m  >  0.  The  parameter  sequence  {92}  for  {(X2(j),  Y2(j))}  is  similarly  defined  by 

92  =  9{m)  +  5A(m),  for  nm  <  j  <  nm+ 1,  m  >  0.  In  the  above,  9{m)  =  (9\(m), . . . ,  9N(m))T 
is  the  value  of  the  parameter  update  that  is  governed  by  the  following  recursion  equations.  For 
z  =  1, ...  ,1V, 


9i(m  +  1)  =  7 n 


8i(m)  + 


nm-\- 1 

E 

j=nm+ 1 


a{j) 


'h{Yl)-h{Y2)' 


25A 


m,i 


(3.5) 


m  >  0.  It  will  be  shown  in  the  proof  of  Theorem  4.1  that  the  sequence  {nm}  is  an  exponentially 
increasing  sequence.  Thus  algorithms  ( N  +  l)-Simulation  FDSA-1,  Two-Simulation  FDSA-1  and 
SPSA-1  update  parameters  at  exponentially  increasing  time  instants.  Hence,  using  these  algo¬ 
rithms,  subsequent  parameter  updates  become  less  frequent  as  time  progresses.  The  algorithm  of 
[5]  (cf.  ( N  +  l)-Simulation  FDSA-2  below)  on  the  other  hand,  uses  coupled  iterations  with  two- 
timescales  and  updates  the  whole  parameter  at  every  instant  (even  though  it  requires  N  + 1  parallel 
simulations  at  any  instant).  We  present  this  algorithm  next. 


(N  +  l)-Simulation  FDSA-2 

Let  {(X(n),  Y (n))}  be  governed  by  {9(n)}  (where  9{n )  =  (#i(n), . . .  9n(ji))t  is  the  nth  update 
of  parameter  9)  and  which  is  updated  according  to  equations  (3.6)  below.  Let  us  also  define  N 
additional  parallel  simulations  as  follows:  For  i  =  l,...,iV,  let  {(X*(n),  Yl(n))}  be  governed  by 
{9(n)  +  <5ej},  where  e*  is  the  unit  vector  with  1  in  the  ith  direction.  In  the  following,  the  sequences 
{Z(n)}  and  {Zl(n)},  i  =  1, .. . ,  N,  perform  weighted  averages  of  the  cost  function  values  and  are 
defined  as  in  the  last  two  equations  in  (3.6)  below.  Let  Z{ 0)  =  Zl{ 0)  =  0,  i  =  1, . . . ,  N.  Then,  for 
i  =  l,...,N, 

9i(n  +  1)  =  7 Ti  ^ 8i(n )  +  a(n) 

Z(n  +  1)  =  Z(n)  +  b(n)(h(Y(n))  -  Z[n)) 

Z\n  +  1)  =  Z\n)  +  b{n)(h{Y\n))  -  Z\n ))  (3.6) 


Z(n)  -  Zi(n) 
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It  is  clear  that  one  requires  N  +  1  parallel  simulations  in  this  manner.  Finally,  we  present  our  next 
SPSA  algorithm  (SPSA-2)  which  requires  only  two  parallel  simulations  as  in  SPSA-1  but  which  also 
allows  for  data  aggregation  over  a  fixed  number  L  of  epochs  in  between  two  successive  parameter 
updates  for  better  performance.  We  will  explain  the  last  part  in  detail  later. 

SPSA-2 

Let  {(X~  (l),Y~  (l))}  and  {(X+(7),  T+(f))}  be  the  two  parallel  simulations.  These  depend  on 
parameter  sequences  {0(n)  —  5A(n)}  and  {0(n)  +  8A(n)}  respectively  in  the  manner  explained 
below:  Let  L  >  1  be  a  given  fixed  integer.  We  extract  double  sequences  {(X“(n),  Y~(n))}  and 
{(X+(n),  T)+(n))},  n  >  0,  m  =  0, 1, . . . ,  L  —  1,  from  the  two  parallel  simulations  in  the  following 
manner.  Write  l  as  l  =  nL  +  m,  where  n  >  0  and  m  €  {0, 1, . . . ,  L  —  1}.  Now,  set  X~(n)  = 
X~ (nL  +  m)  and  Y~(n)  =  Y~ (nL  +  m).  Similarly,  X+(n)  =  X+(nL  +  m)  and  Y+(n)  =  Y+(nL  + 
m)  respectively.  Now,  for  m  =  0, —  1,  (X~(n),Y~(n))  is  governed  by  the  parameter 
6(n )  —  8A(n).  Similarly,  for  m  =  0, 1, . . . ,  L  —  1,  (X+(n),  Y^(n))  is  governed  by  the  parameter 

8(n)  +  8A(n).  We  also  define  two  double  sequences  {Z~(n)}  and  {Z+(n)},  n  >  0,  m  =  0, 1, . . . , L  — 

1,  in  recursions  (3.7)  for  averaging  the  cost  function.  Let  Z^( 0)  =  Z-f( 0)  =  . . .  =  Z£( 0)  =  0  and 
Zq  (0)  =  Z ^  (0)  =  . . .  =  Z ^  (0)  =  0.  Then,  for  i  =  1, . . . ,  N,  n  >  0, 

6i(n  +  1)  =  7 Ti  ^6»j(n)  +  a(n) 

where,  for  m  =  0, 1, . . . ,  L  —  1, 

zrn+ i(n  +  !)  =  zrn(n  +  1)  +  b{n)(h(Y-(n  +  1))  -  Z~(n  +  1)), 

zrn+ i(n  +  !)  =  zrn(n  +  !)  +  b(n)(h{Y+  (n  +  1))  -  Z+(n  +  1)),  (3.7) 

with  Z£ (n  +  1)  =  Z£ (n)  and  Zq  (n  +  1)  =  Z £ ( n ).  Note  again  that  one  requires  only  two  parallel 
simulations  in  this  manner  as  opposed  to  N  +  1  earlier. 

Remark  Note  that  for  L  =  1,  the  algorithm  SPSA-2  is  simply  as  follows:  Let  {(X~ (n),Y~ (n))} 
and  {(X+(n),  Y+(n))}  be  the  two  parallel  simulations  respectively  governed  by  {0{ri)  —  e>A(n)} 
and  {6(n)  +  <5A(n)}.  Then,  for  i  =  1, . . . ,  N, 

9i(n  +  1)  =  7 u  foi(n)  +  a(n) 

Z~(n  +  1)  =  Z~(n)  +  b(n)(h{Y-(n))  -Z~(n)), 

Z+(n  +  1)  =  Z+{n)  +  b(n)(h{Y+(n))  -  Z+(n)),  (3.8) 

with  Z~{ 0)  =  Z+( 0)  =  0.  We  observed  in  the  numerical  experiments  that  the  algorithm  (3.8) 
(corresponding  to  L  =  1)  did  not  exhibit  good  performance  when  the  parameter  dimension  is  high. 
This  could  probably  be  due  to  the  fact  that  in  the  latter  case,  the  system  does  not  adapt  as  quickly 
to  the  new  parameter  update  before  it  changes  again.  By  selecting  L  >  1,  one  can  effectively  take 
care  of  this  problem  by  holding  the  parameter  fixed  for  L  instants,  thus  giving  the  system  sufficient 
time  to  adapt  to  the  new  parameter  update.  The  choice  of  L  is  completely  arbitrary  though.  In 
the  numerical  experiments  for  instance,  where  we  consider  the  parameter  vectors  to  be  10  and 
40-dimensional  respectively,  the  value  of  L  is  chosen  as  100. 


Z-{n)-Z+(n) 


28A 


n,i 
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3.2  Comparison  of  Algorithms 

We  can  classify  the  five  algorithms  broadly  into  two  categories  -  those  that  update  the  parameter 
over  time  instants  (or  their  multiples)  of  increasing  separation  nm,  m  >  1,  and  those  that  update 
the  parameter  at  regular  intervals.  Let  us  first  consider  the  algorithms  in  the  first  category.  These 
comprise  ( N  +  l)-Simulation  FDSA-1,  Two-Simulation  FDSA-1  and  SPSA-1.  As  already  stated 
earlier,  ( N  +  l)-Simulation  FDSA-1  requires  N  + 1  parallel  simulations  at  every  instant  but  updates 
the  whole  parameter  vector  once  every  nm  instants.  On  the  other  hand,  Two-Simulation  FDSA-1 
uses  only  two  simulations  and  updates  the  parameter  in  cycles  of  njvm,  m  >  1  instants,  where  N 
is  the  parameter  dimension.  SPSA-1,  however,  updates  the  full  parameter  every  nm  instants  and 
still  requires  only  two  parallel  simulations  for  doing  so.  Thus  SPSA-1  has  the  combined  advantages 
of  ( N  +  l)-Simulation  FDSA-1  and  Two-Simulation  FDSA-1.  Moreover  SPSA-1  tracks  trajectories 
of  the  ordinary  differential  equation  (o.d.e.  )  (4.7)  as  does  ( N  +  l)-Simulation  FDSA-1.  It  was 
shown  in  [4]  that  Two-Simulation  FDSA-1  tracks  trajectories  of  an  o.d.e.  that  is  similar  to  (4.7) 
but  with  a  factor  1/A  multiplying  its  RHS.  This  factor  essentially  serves  to  slow  down  the  rate  of 
convergence  of  the  algorithm  (3.4). 

Our  next  set  of  algorithms  viz.,  (A+l)-Simulation  FDSA-2  and  SPSA-2  update  parameters  after 
every  fixed  number  of  instants.  In  particular,  (A+l)-Simulation  FDSA-2  updates  the  full  parameter 
vector  every  instant  while  requiring  N  +  1  parallel  simulations  for  the  same.  SPSA-2,  on  the  other 
hand,  requires  only  two  parallel  simulations  and  updates  the  parameter  after  every  fixed  number 
‘L’  of  instants.  This  number  (L  >  1)  is  chosen  arbitrarily.  SPSA-2  thus  allows  for  data  aggregation 
in  between  successive  parameter  update  epochs  (for  better  performance)  while  requiring  only  two 
parallel  simulations  at  any  instant.  It  will  be  shown  in  Section  5  that  SPSA-2  tracks  the  trajectories 
of  the  o.d.e.  (4.7)  on  the  slower  timescale.  A  similar  result  was  shown  in  [5]  for  ( N  +  l)-Simulation 
FDSA-2.  In  Section  6,  we  shall  consider  a  simple  queueing  network  with  parameter  vectors  of 
dimensions  10  and  40  respectively.  We  found  that  both  SPSA-1  and  SPSA-2  exhibit  significantly 
superior  performance  than  the  algorithms  in  [4]  and  [5]  mentioned  here.  We  shall  consider  in 
Section  6  the  following  step-size  sequences  (a(ra)}  and  (6(n)}:  a(0)  =  6(0)  =  1,  a(n)  =  1/n  and 
6(n)  =  1/n2/3,  n  >  1.  For  these  sequences,  no  =  1,  n5o  ~  2.5  x  104,  nioo  ~  4.3  x  105  etc.  Thus,  a 
possible  disadvantage  in  using  SPSA-1  is  for  large  systems/networks  that  require  several  updates 
of  the  parameter  before  convergence  is  achieved,  since  in  using  this  scheme,  successive  parameter 
updates  are  held  fixed  over  intervals  of  increasing  sizes  and  thus  the  parameter  is  updated  less  often 
as  time  progresses.  This  is  not  the  case  with  algorithm  SPSA-2  where  we  hold  the  parameter  fixed 
only  for  a  fixed  number  L  of  epochs  before  updating  it.  This  intuition  is  also  confirmed  in  Section 
6  where  we  show  numerical  experiments  with  parameters  of  dimensions  10  and  40  respectively.  We 
observed  that  when  the  parameter  dimension  is  10,  SPSA-1  outperforms  SPSA-2.  However,  when 
the  same  is  increased  to  40,  it  is  SPSA-2  that  performs  better  than  SPSA-1. 

As  already  observed,  the  algorithms  SPSA-1  and  SPSA-2  are  computationally  superior  than 
their  corresponding  variants.  This  is  however  achieved  by  generating  N  i.i.d.  random  variables 
Anji, . . . ,  AnjAr,  that  satisfy  Condition  (B).  In  particular,  one  could  select  these  to  be  i.i.d.,  Bernoulli 
distributed  (as  we  do  in  our  numerical  experiments  in  Section  6)  viz.,  A =  ±1,  w.p.  1/2, 
i  =  1, ...  ,N.  It  will  become  clear  in  the  convergence  analysis  in  the  next  two  sections  that 
it  is  these  randomizations  and  the  particular  form  of  the  gradient  estimates  that  are  primarily 
responsible  for  both  of  these  schemes  using  only  two  parallel  simulations  at  any  instant  as  against 
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N  +  1.  The  presence  of  A in  the  denominator  of  the  gradient  estimate  term  on  the  RHS  in 
the  update  equation  for  the  ith  component  0*(n)  of  6{n)  in  (3.5),  and  the  first  equation  in  (3.7), 
essentially  ensures  that  the  update  for  the  ?'th  component  occurs  only  in  the  ith  gradient  direction 
(with  the  rest  of  them  averaging  to  zero).  Generating  N  i.i.d. ,  Bernoulli  random  variables  or  in 
general  those  satisfying  Condition  (B),  is  far  more  computationally  simpler  than  generating  N 
parallel  simulations,  the  latter  requires  in  particular  simulating  N  independent  parallel  systems. 
It  will  be  shown  in  the  next  two  sections  that  algorithms  (3.5)  and  (3.7)  asymptotically  track  the 
stable  points  of  the  o.d.e.  (4.7). 

Finally,  in  [6],  the  algorithm  SPSA-1  was  analyzed  in  the  context  of  rate  based  feedback  flow 
control  in  available  bit  rate  (ABR)  service  in  asynchronous  transfer  mode  (ATM)  networks.  How¬ 
ever,  because  of  the  finite  state  setting  there,  questions  about  stability  of  the  scheme  did  not  arise. 
This  is  however  not  the  case  here.  Our  state  space  is  unbounded  and  hence  we  require  Liapunov 
stability  assumptions  on  the  system  to  ensure  tightness. 


4  Convergence  Analysis  of  SPSA-1 


By  a  Polish  space,  we  mean  a  complete  separable  metric  space.  Let  S'  be  a  Polish  space  with 
complete  metric  d.  Also,  let  V(S)  be  the  space  of  probability  measures  on  S.  Let  C&(S )  be  the 
space  of  all  bounded  and  continuous  functions  on  S.  A  countable  sequence  of  functions  {/*}  C  Cj,(S) 
is  called  a  separating  class  for  V(S)  if  for  any  probability  measures  p,  v  G  V(S),  j  fadp  =  j  fidu, 
i  >  1,  implies  p  =  v.  It  is  easy  to  show  (cf.  Ch.2  of  [9])  that  such  a  separating  class  of  functions 
exists  for  any  such  V(S).  Also,  a  class  of  functions  {fa,a  G  1}  C  Cb(S)  is  called  a  convergence 
determining  class  for  V(S)  if  for  any  sequence  of  measures  {/rn}^=1  G  V(S),  J  fadpn  — »  J  fadp oo 
Va  G  /,  implies  pn  — >  p^  in  V(S). 

Now  define  a  metric  p  on  V(S)  as  follows:  For  any  p,  v  G  V(S), 


p(p,  v)  =  2  n  I  fndV  ~  [  fndv 

71=1  J  J 


where  {fn}  is  a  separating  class  of  functions  for  V{S).  The  topology  induced  by  p  is  called  the 
Prohorov  topology.  It  can  be  shown  that  under  p,  the  space  V(S)  is  complete  if  and  only  if  S  is 
compact  and  not  otherwise.  Thus  p  is  not  complete  if  S  is  not  compact.  Now  consider  the  following 
metric  p  defined  in  the  following  manner:  For  e  >  0  and  Borel  A  C  S,  let  Ae  =  {x  G  S  \  d(x,  A)  <  e}. 
For  p,  v  G  V(S),  define 


p(p,  v)  =  inf{e  >  0  |  p(A)  <  u(Ae)  +  e,  v{A)  <  p(Ae)  +  e,  for  all  Borel  A  C  5}. 


The  metric  p  is  called  the  Prohorov  metric.  It  retains  the  same  topology  as  the  Prohorov  topology 
and  is  in  addition  complete.  Moreover,  V(S)  is  separable  under  the  Prohorov  topology  thereby 
making  the  space  V(S)  Polish  without  the  requirement  that  S  be  compact  (in  addition  to  being 
Polish).  It  can  also  be  shown  that  if  S'  is  a  compact  Polish  space,  then  so  is  V(S).  In  what  follows, 
we  shall  assume  that  V(S)  is  equipped  with  the  Prohorov  topology  and  is  thus  Polish. 

Consider  now  the  stochastic  approximation  scheme  SPSA-1.  Recall  that  {(Xk(j),Yk(j))}:  k  = 
1, 2,  are  the  two  parallel  simulations  respectively  governed  by  {0k },  k  =  1,2,  with  dj  =  9(m)—5A(m) 
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and  62  =  6{m)  +  5A(m)  respectively,  for  nm  <  j  <  nm+ 1,  m  >  0.  Also,  the  dynamics  of  the 
simulations  {(Xk(j),  Yk(j))},  k  =  1,2,  is  governed  by  equations  of  type  (2.3)-(2.4),  with  £(j),  rj(j) 
replaced  with  analogously  defined  £fc(j),  rjk(j),  k  =  1,2,  respectively  independent  of  one  another. 

In  this  section,  we  assume  (Al),  (A2)  and  (A3)  (defined  in  Section  2)  for  our  system.  Let  the 
filtration  be  represented  by  Tn  =  a  {Xk(j),Yk(j),£k(j),rik(j),  9j,Aj,  k  =  1,  2;  j  <  n),  where 
6j  =  6{m)  and  A  j  =  A  (m)  for  nm  <  j  <  nm+ 1,  m  >  0.  For  m  >  0,  dehne  V(lZq+d )  -  valued 
random  variables  {/U^},  k  =  1,2,  for  /  €  C'f,(7?.<J+d)  (the  space  of  bounded  and  continuous  functions 
on  ^+d)  by 

E”r„+:+ia«/(xfe(i),yfe(i)) 


fd^m.  = 


^-m+1 


nm+l 


a(i) 


m  >  0,  k  =  1, 2.  Let  =  TU3-1^  U  {oo}  denote  the  one  point  compactihcation  of  TU3-1^  and  let 

5oo  denote  the  Dirac  measure  at  {oo}.  In  other  words, 


<$00(3;)  = 


1  if  x  =  {00} 
0  otherwise 


(4.1) 


Using  the  natural  embedding  of  7 Zq+d  into  lZq+d ,  one  may  view  {/z^,  m  >  0},  k  =  1,2,  as  random 
sequences  in  V(1 Zq+d),  a  compact  Polish  space.  The  proof  of  Theorem  4.1  (below)  differs  at  several 
places  from  that  of  a  similar  result  in  [6]  since  as  already  stated  earlier,  the  latter  was  shown  for 
finite  state  spaces  which  is  not  the  case  here.  Moreover,  Theorem  4.1  here,  also  serves  to  show 
stability  of  the  proposed  scheme,  instead  of  just  being  a  step  in  the  convergence  analysis  as  in  [6]. 

Theorem  4.1  Almost  surely,  (/x^,  /q2n,  8(m),  A(m)),  m  >  0,  converges  in  V(lZk+d)2  x  C  x  E 
to  the  compact  set  {(^e-5A,  k-s+5A,  8,  A)  |  9  £  C,  A  £  E}. 

Proof  The  proof  proceeds  through  several  steps.  For  /  £  Cj,( 7Zq+d),  define  sequences 
{Zk(m),m  >  1},  k  =  1,2,  by 


m—  1 

zk(m)  = 

3= 0 


ni+i 

E 

i=n-j- 1-1 


/ 


(4.2) 


Then  {Zk(m),  Fnm},  k  =  1,2,  are  zero  mean,  square  integrable  martingale  sequences.  Let  us 
represent  their  quadratic  variation  processes  by  {<  Zk  >  (m)},  k  =  1,2,  respectively.  Then  by 
definition, 

m—1 

<  Zk  >  (m)  =  Y  E 
j= 0 

with  <  Zk  >  (00)  =  lim  <  Zk  >  (m).  For  ease  of  exposition,  let  us  denote  Fk(i)  A  f(Xk(i),Yk(i)), 

771 — y  00 

k  =  1,2,  in  this  part  of  the  proof.  Note  that 


(Zk(j  +  1)  -  Zk(j))2  |  Fn 


+  E 


Zk{  0)2 


(4.3) 


E 


(■ Zk(j  +  1)  -  Zk(j))2  |  Fn 


=  E 


nj+ 1 


m 


Y  a(i)(Fk(i)-E[Fk(i)  |  J^-i]) 


Fn 


<i=rij- 1-1 


=  E 


ni+ 1 


6(J)2  ,-A 


E  a(z)2  (Ffc(i)  -  E  [Fk{i)  |  ^_!])2  | 


i=rij- 1-1 


11 


+E 


Uj+ 1 


b(j)2 

'  i,l=rij- 1-1, 


E  a(i)a(l)  (Fk(i)  -  E  [Fk(i)  \  Ei-i])  ( Fk(l )  -  E  [ Fk{l )  \  E^})  \  En 


The  second  term  on  the  RHS  above  can  be  written  as 


nj+1 


E 


+E 


m 


i 


E  E  [«(*)°(0  (**(*)  -  E  [**(»)  I  (Fk(l)  -  E  [. Fk(l )  I  El.!})  I  Ji_i]  I  En 


i,l=rij-\- 1,  i>l 


n3+ 1 


53  E  [ o(i)o(Z )  (Ffc(i)  -  E  [Fk(i)  |  ^_i])  (Ffc(0  -  E  [. Fk{l )  \  E^})  \  E^}  \  Enj 

i,l=nj+ 1,  i<Z 


=  0  a.s., 

since  for  i  >  l,  Ei  C  .F)_i  and  similarly  Ei  C  ^j_i  for  i  <  l.  Thus 

„o\2 


1 


nj+i 


b(j)2 


E 


(zfc(j  + 1)  -  zfe(j)r  i 

(. f(Xk(i),Yk(i ))  -  E  [/(Xfc(i),yfc«)])2  I 


(4.4) 


Now  since  /  is  a  bounded  and  continuous  function,  there  exist  constants  Ki,K[  >  0  such  that 

<  *  >  <"*>  *  *  g  b(]f  X  ^  <-  *  g  ^  fc  ■ -  d ■ 


1 


the  latter  inequality  follows  from  the  fact  that  Ea(^)2  ~  1 - •  Also  note  that  53  aW  ~  ln(n), 


i=0 


n 

^m+1 


i= 0 


and  E  b(i)  ~  n1  Now,  from  the  definition  of  {?im,  m  >  0},  E  o(*)  ~  E&«-  Thus)  ln(nm+i) 
2—0  2—0  2—0 
ps  m}~a.  Hence,  nm+i  ps  exp(am1^")  for  some  a  >  0.  Thus,  <  Zk  >  (oo)  <  oo.  Hence,  by 

Proposition  V11.2.3(c)  of  [26],  {Zk(m),m  >  1},  k  =  1,2,  are  a.s.  convergent  martingale  sequences. 
Now  let  us  consider  {Zi(m)}.  From  the  fact  that  E=E+i  a(j)/b(m)  —$■  1,  as  m  — >  oo  and  from 
(3.1), 

f(x,  y)-j  fix',  y')pe(m)-8A(rn)  (®,  V,  dx' ,  dy')  \  ulnidx,  dy)  0  a.s.  (4.5) 


Now,  since  the  above  holds  for  all  /  €  Cb{Eq+d),  it  holds  in  particular  for  those  f  €  Cb(Tlq+d)  that 
are  in  a  countable  convergence  determining  class  of  V(lZq+d).  Hence,  outside  a  set  of  measure  zero, 
any  limit  point  of  (p}n,9(m)  —  SA(m)),  m  >  0,  in  V(JZq+d)  x  C  must  be  of  the  form  ( bSoo  +  (1  — 
b)p,9  —  S A).  In  the  above,  b  €  [0, 1],  where  when  b  <  1,  p  must  satisfy 


f(x,y)p(dx,dy)=J  j  f(x',y')po_SA(x,y;dx',dy')y(dx,dy). 


Thus  for  6  <  1,  //  must  be  of  the  form  pesA-  For  b  =  1,  p  is  arbitrary  and  hence  can  be  set 
to  be  po-6A  itself.  Now,  note  that  if  in  the  definition  of  the  sequences  {Zk(m)},  k  =  1,2,  the 
function  /(.,.)  is  replaced  by  the  Liapunov  function  V(.,.),  the  sequences  {Zk(m)}  continue  to 
be  martingale  sequences.  Also,  from  Assumption  (A3. 2)  and  (4.4),  it  is  clear  that  the  quadratic 
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variation  process  of  such  a  martingale  would  converge  as  well.  We  then  obtain  (4.5)  with  function 
V  replacing  /.  Now  define 

<l>0-8A(x,y)  =  I  V{x' ,y')p6-s&(x,y\dx' ,dy')  -  V(x,y), 

with  00-(5a({oo})  =  —eo,  VO  G  C.  Here  eo  is  the  same  as  in  Assumption  (A3. 3).  Now,  as  a  conse¬ 
quence  of  Assumption  (A3. 3),  the  map  ( 0 ,  x,  y )  — >  <j>e(x,  y)  :  C  x  7 Zq+d  — >  1Z  is  upper  semicontinuous 
and  bounded  from  above.  Further,  if  (/r^, Q(m )  —  8A(m))  — »  (b8 oo  +  (1  —  b)fig_sA,0  —  5 A)  along 
a  subsequence,  then  from  (4.5)  with  /  replaced  by  V.  we  have, 

0  =  limsup  /  <l>e(m)-5A(m)df4n  <(!-&)  /  <t>e-6Ady.e-SA  ~  be 0,  (4.6) 

m— >-oo  J  J 

along  the  same  subsequence.  Now,  it  is  clear  from  the  definitions  of  4>e~5A  and  ygsA  above  that 


J  4>e-8Adye-5A  =  0. 


Thus  from  (4.6),  we  have  0  <  —be o,  which  cannot  happen  unless  b  =  0.  A  similar  argument  holds 
for  the  sequence  {/q2n}.  Thus  {/r^},  k  =  1,2,  are  tight  in  V{lZq+d)  and  have  limit  points  of  the 
form  Mo_<5A  or  M0+5A-  Now  by  Lemma  2.1,  the  maps  6  — »  and  6*  /Ug+5A,  k  =  1,2,  are 

continuous.  The  claim  now  follows  from  the  fact  that  any  continuous  image  of  a  compact  set  is 
compact.  □ 

The  final  step  is  to  show  convergence  of  the  algorithm  (3.5)  to  the  set  of  local  minima.  The 
o.d.e.  technique  is  commonly  used  to  prove  convergence  of  stochastic  approximation  algorithms. 
Here,  we  show  that  the  algorithm  (3.5)  asymptotically  converges  to  the  stable  points  of  the  o.d.e. 
(4.7).  For  any  function  H  :  1ZN  — >  1Z,  let  VH(x)  =  [Vi H(x), . . . ,  V a rH(x)]T  represent  the  gradient 
of  H  at  the  point  x  G  1Z N .  Let  Z(t)  =  (Zi(t), . . .  ,  Z^it))  G  1Z N ,  with  Zi(t),  i  =  1, . . . ,  N,  satisfying 
the  o.d.e. 

Zi  (t)  =  7 Ti(-ViJ(Z(t))),  t  >  0,  Z( 0)  G  C,  (4.7) 

where  for  any  bounded,  continuous,  real  valued  function  v(.), 


Ki(v{y)) 


lim 

0<Tj— >o 


(  n(y  +  yv{y)) 
V  v 


For  x  =  (xi, . . .  ,xn)T,  let  n(x)  =  (#i(xi), . . . ,  ttn{xn))T ■  The  operator  % f(.)  forces  the  o.d.e.  (4.7) 
to  evolve  within  the  constraint  set  C.  Let  K  =  {0  G  C  \  i f(VJ(0))  =  0}. 

We  recall  here  a  key  result  from  [18]  stated  as  Lemma  4.1  below.  Consider  an  o.d.e.  in  1ZN 


x  {t)  =  F(x(t)), 


(4.8) 


which  has  an  asymptotically  stable  attracting  set  G.  Let  Ge  denote  the  e-neighborhood  of  G  viz., 
Ge  =  {x  |  3x'  G  G  s.t.  ||  x  —  x'  || <  e},  where  ||  •  ||  represents  the  sup  norm.  For  T  >  0,  7  >  0,  say 
that  y(  )  is  a  (T,  7)-perturbation  of  (4.8)  if  there  exist  real  numbers  0  =  To  <  T\  <  T2  <  ■  ■  •,  such 
that  Tl+\  — Ti>T ,  Vi,  and  on  each  interval  [Tt,  T)+i],  there  exists  a  solution  xl(-)  of  (4.8)  such  that 


sup 

£Epi,Ti_ 1-1  ] 


xl{t)-y{t) 


<  7- 
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The  following  result  is  adapted  from  [18],  pp.339.  The  proof  of  this  can  be  found  in  the  appendix 
of  [10]. 

Lemma  4.1  For  given  e  >  0,  T  >  0,  there  exists  a  7  such  that  for  all  7  £  [0,7],  any  (T,  7)- 
perturbation  of  (4.8)  converges  to  G€.  □ 

For  fixed  77  >  0,  let  K'l  =  {9  €  C  |  36'  £  K  s.t.  \  \9  -  0'||  <  rj }  represent  the  set  of  points  within 
a  distance  77  of  the  set  K.  As  a  direct  consequence  of  Lemma  4.1,  for  any  given  77,  T  >  0,  Ely  >  0 
s.t.  V7  €  [0,7],  any  (T,  7)-perturbation  of  (4.7)  shall  converge  to  Kv.  Finally,  Theorem  4.2  shows 
that  given  77  >  0,  there  exists  a  8  >  0  such  that  the  algorithm  SPSA-1  for  all  8  <  8,  converges  to 
Kv  a.s.  We  only  sketch  the  proof  of  this  theorem  here  since  the  details  are  similar  to  that  in  [6]. 

Theorem  4.2  Given  77  >  0,  >  0  such  that  for  any  <5  £  (0,  <5],  the  algorithm  (3.5)  converges 

to  Kn  almost  surely. 

Proof  (sketch)  Note  that  the  algorithm  (3.5)  can  be  written  as: 


0i(m  +  1)  =  7 Ti 


Now  from  the  fact  that  Sj=nrL+l  a(j)/b(m)  — >  1  as  m  — >  00  and  using  the  conclusions  of  Theorem 
4.1,  the  algorithm  can  be  shown  to  asymptotically  exhibit  the  same  behaviour  as  the  following 
algorithm: 


9i(m  +  1)  =  7Tj  9i(m)  +  b(m) 


J(9{m)  —  8A(m))  —  J(9(m)  +  8A(m)) 


28Amti  )  )  ' 

Now  construct  martingale  sequences  {Nt(p)}.  i  =  1 , ,N,  as  follows:  For  i  =  1 , ,N, 


(4.9) 


N*(p)  = 


j= 0 


J(9{j)  —  8A(j))  —  J{9{j)  +  8A(j)) 
28Ajti 


-E 


J(9(j)-8A(j))-J(9(j)+8A(j)) 


28A 


T’ . 

•r  3 


■3,1 


where,  for  j  >  1,  T’ j  =  cr(9( 0),  0(1),  . . .,  9(j),  A(0),  A(l),  . . .,  A(j  —  1))  represents  the  filtration 
and  the  expectation  E[-\  is  w.r.t.  the  common  expectation  of  Aj_t.  Note  that  A  (j)  is  independent 
of  E'y  It  can  now  be  easily  shown  (see  [6])  that  sequences  {Ni(p)},  i  =  1,  ...,1V,  converge  a.s., 
as  p  — >  00.  Hence,  one  could  replace  algorithm  (4.9)  by  the  following  equivalent  algorithm:  For 
z  =  1, ...  ,7V, 


9i(m  +  1)  =  7 Ti 


^ '  9  i{m )  +  b(m)E 


J(9(m)  —  8A(m))  —  J(9{m )  +  8A(m)) 
28Am:i 


Finally,  it  can  be  shown  that  as  8  — >  0, 


(4.10) 


J{9{j)  —  8A(j))  -  J(9(j)  +5A(j)) 

28Aji 


ViJ(0(j)) 


->  0. 
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Thus  (4.10)  can  be  shown  to  asymptotically  exhibit  analogous  behaviour  as  the  following  algorithm: 
For  i  = 

9i{m  +  1)  =  7 Ti  ( 6i(m )  +  b(m)  (-Vi  J(0(m)))) . 

The  last  finally  can  be  shown  to  asymptotically  track  the  stable  points  of  the  o.d.e.  (4.7).  □ 

Remark  Note  that  K  is  the  set  of  all  critical  points  of  (4.7),  and  not  just  the  set  of  local 
minima.  However,  points  in  K  that  are  not  local  minima  will  be  unstable  equilibria  and  since  our 
algorithm  is  of  the  gradient  descent  type,  it  will  converge  a.s.  to  the  ^-neighborhood  of  Kq(=  the 
set  of  local  minima  of  </(.))  C  K. 


5  Convergence  Analysis  of  SPSA-2 


Consider  now  the  stochastic  approximation  scheme  SPSA-2.  We  assume  (Al),  (A2')  and  (A3)  in  this 
section.  Recall  that  {(X~(l),  Y~(l))}  and  {( X+(l ),  T+(0)}  are  the  two  parallel  simulations  respec¬ 
tively  governed  by  {0(1)  —  8A(l)}  and  {0(1)  +  8A(l)},  where  9(1)  =  9 


and  where 


l 


represents  integral  part  of  — .  In  other  words,  if  l  has  the  form  l  =  nL 

1j 


m  G  {0, 1, . . .  ,  L  —  1}  and  n  is  an  integer,  then 


and  A  (l)  =  A 

m,  where 

=  n.  Note  that  by  definition,  {9(1)}  (resp. 


(A(Z)})  takes  values  in  the  compact  set  C  (resp.  E).  Define  V(TZq  x1Zd  xC  x  77)-valued  processes 

{/vl  and  04}  by 


71—1 

H~(A1  xA2xBxD)  =  -Yj  I{X~(m)  <5  A±,  Y~(m)  €  A2,0(m)  G  B,  A (m)  €  D} 

n  m=0 

and 

-t  n—1 

/4(Ai  xA2xBxD)  =  ~Yj  I{X+(m )  G  Alt  Y+(m)  G  A2, 9(m)  G  B,  A  (m)  G  D}, 

Tt 

m= 0 


for  Borel  sets  A±  C  7Zq,  A2  C  74,  B  C  C  and  D  C  E.  Let  7 Zq+d  =  77.9+<iU{oo}  denote  the  one  point 
compactification  of  1Zq+d.  The  following  theorem  establishes  tightness  of  sequences  {/U“}  and  {/4 } 
in  V(TZq  x  lZd  x  C  x  E).  The  proof  follows  in  a  somewhat  similar  manner  as  the  proof  of  Theorem 
4.1.  Here  one  first  shows  under  the  martingale  stability  theorem  of  [25],  pp.53  and  Assumption 
(A2')  that  {fi^}  and  {/4I  are  tight  in  V(lZq+d  x  C  x  E)  which  is  a  compact  Polish  space.  Finally, 
using  the  Liapunov  function  V(., .)  in  Assumption  (A3)  and  again  the  martingale  stability  theorem 
of  [25],  pp.53,  one  shows  that  {/U“}  and  {/4}  are  iR  fa°t  tight  in  V(1Zq  x  lZd  x  C  x  E)  itself.  We 
do  not  present  the  proof  of  Theorem  5.1  here  so  as  to  avoid  repetition. 

Theorem  5.1  Almost  surely,  {/7“}  and  {/j+ }  are  tight  sequences  in  V(lZq  x  lZd  x  C  x  E).  □ 

We  now  proceed  with  the  rest  of  the  analysis.  Let  for  k  >  1,  E'k  =  a(9(0),  9(1),  . . 9(k),  A(0), 
A(l),  . . .,  A (k  —  1)).  Then  A (k)  is  independent  of  E'k,  Vfc  >  1.  Dehne  sequences  {N~(p),p  >  1}, 
(TVA (p) , p  >  1};  i  =  1, . . . ,  TV,  as  follows: 


Ni  (P)  =  a(j) 
7=o 


J(0(j)-SA(j)) 


A 


-E 


J(9(j)  -  SA(j)) 


A 


E'i 


■3,1 
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and 


N?(p)  = 

j=o 


J(9(j)  +  5A(j)) 


A 


—  E 


‘3,* 


J(9(j)  +  6A(j)) 


A 


‘3,t 


Then,  we  have 

Lemma  5.1  For  every  i  =  1, . . .  ,N,  {N~(p)}  and  {iV+ (p)}  converge  a.s. 

Proof  Follows  in  a  similar  manner  as  Lemma  A. 2  of  [6].  □ 

71—1 

Now  let  s(0)  =  0,  s(n)  =  ^  a(z),  n  >  0.  Let  Atjj  =  A n^,  for  t  €  [s(n),  s(n+l)],  n  >  1.  Further, 

i=0 

let  A (t)  =  (Atii, . . . ,  Atijv)T-  Recall  that  for  any  bounded,  continuous,  real  valued  function  v(-), 


Ki{v{y)) 


lim 

0<Tj— >o 


/  My  +  yv{y)) 

v  9 


Also,  for  x  =  (xi, . . .  ,xn)t  £  7 ZN ,  i f(x)  =  (7fi(x), . . .  ,tcn(x))t .  Consider  the  following  o.d.e.:  For 
*  =  1, ...  ,7V, 

9i  (t)  =  7 fj  f-F 

where,  the  operator  F[-]  in  (5.1)  represents  the  expectation  w.r.t.  the  common  c.d.f.  of  {A^j}. 
Recall  that  L  represents  the  number  of  data  aggregation  epochs  in  between  two  successive  parameter 
updates  in  SPSA-2.  Let  c(i )  =  b(i)L.  It  is  easy  to  see  that 


J(6(t)  -  8A(t))  -  J(9(t)  +  6A(t)) 


25  A 


t,i 


(5.1) 


^c(i)  =  oo,  ^c(i)2<oo,  a(i)  =  o{c(i)). 

i  i 

Here,  we  shall  consider  {a(i)}  and  (c(z)}  to  be  the  two  step-size  sequences  (as  opposed  to  (a(i)} 
and  (6(z)}  in  SPSA-1).  Let  t(0)  =  0,  t(n)  =  Si!=olc(*)>  n  >  1.  Let  :  [0,  oo)  — >  1Z 

and  9{-)  :  [0,  oo)  — >  C  denote  the  continuous  functions  obtained  by  setting  z~(t(n ))  =  Z£(n), 
z+(t(n))  =  Z^(n),  9(t(n ))  =  9(n)  respectively  Vn,  with  linear  interpolation  on  [t(n),t(n  +  1)], 
n  >  0.  Consider  the  system  of  o.d.e. ’s 

9  ( t )  =  0, 

i-  (■ t )  =  J(9(t)  —  8A(t))  —  z~(t), 

z+  ( t )  =  J(9(t)  +  5A(t ))  —  z+{t).  (5-2) 

We  now  have 

Theorem  5.2  For  any  T,  <5  >  0,  ( z~(t(n )  +  . ),z+(t(n )  +  .),9(t(n)  +  .))  is  a  bounded  (T,S)- 
perturbation  of  (5.2)  for  n  sufficiently  large. 

Proof  Note  that  the  algorithm  SPSA-2  (cf.  (3.7))  can  be  rewritten  as  follows:  For  i  =  1, . . . ,  N, 
9i(n  +  1)  =  7 T*  (oi{n)  +  a(n) 

1  L-x 

ZL  {n  +  l)  =  Zj L  (n)  +  c(n)-  ^  (h(Ym  (n  +  1))  -  Zm(n  +  1)), 
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(5.3) 


1  L~l 

Z+(n  +  l)  =  Z+(n)  +  c(n)-  X  (h(Ym  (n  +  1))  “  Zm(n  +  1)). 


From  {Zjn(ri)}  and  {Z+(n)},  n  >  0,  m  €  {0, 1, . . .  ,  L  —  1}  defined  in  (3.7),  we  obtain  sequences 
and  {Z+(l)}  as  follows:  First  write  l  as  l  =  nL+m  for  some  n  >  0  and  m  £  {0, 1, ... ,  L— 1}. 
Then  set  Z~(l )  =  Z~(nL  +  rn)  =  Z~(n).  Similarly  set  Z+(l )  =  Z+(nL  +  rri)  =  Z+(n).  Now  from 
the  second  and  third  equations  in  (3.7),  note  that  Z~(l  +  1)  (resp.  Z+{1  +  1))  is  the  convex 
combination  of  h(.)  and  Z~(l )  (resp.  Z+{1)).  Thus  {Z~(l)}  (resp.  {Z+(l)})  are  uniformly 
bounded  sequences  with  upper  bound  depending  on  h  and  j|  Z~( 0)  ||  (resp.  ||  Z+{ 0)  ||).  Now 
{Z^(n)}  (resp.  {Z^{n)})  is  just  a  subsequence  of  {Z~(l)}  (resp.  {Z+(l)})  and  hence  is  uniformly 
bounded  (irrespective  of  the  value  of  L )  as  well.  Before  we  proceed  further,  let  us  look  at  the  term 


1 

I 


L- 1 


X  Zm{n  +  1) 

m= 0 


on  the  RHS  of  the  second  equation  in  (5.3).  We  will  show  that  it  has  the  same 


1 


L—l 


asymptotic  behaviour  as  ZL  (n).  A  similar  argument  shall  hold  for  the  term  —  X  Z m(n  +  1)  on  the 

L  m=0 


RHS  of  the  third  equation  in  (5.3).  First  note  that  the  terms  Z^(n+  1),  Z±  (n  +  l),...,Z£_1(n  +  1) 
are  all  governed  by  the  same  parameter  update  viz.,  (9(n  +  1)  —  8A(n  +  1)).  Now  recall  that  for 
m  £  {0, 1, . . . ,  L  —  1},  Z~(n  +  1)  =  Z~((n  +  1  )L  +  m).  For  notational  simplicity  let  ( n  +  1  )L  =  k 
in  the  rest  of  the  proof.  Now  from  the  second  equation  in  (3.7), 


Z~(k  +  !)  =  (!  —  b(n))Z~(k)  +  b(n)h(Y~(k)) 


Similarly, 

Z~(k  +  2)  =  (1  -  b{n))Z~(k  +  1)  +  b{n)h(Y~(k  +  1)) 

=  (1  -  b{n))2Z~{k )  +  (1  -  b(n))b(n)h(Y~ (k))  +  b{n)h(Y~(k  +  1)) 

Proceeding  in  this  manner,  one  obtains 

Z~(k+L- 1)  =  (l—b(n))L~1Z~(k)+(l—b(n))L~2b(n)h(Y~(k))+(l—b(n))L~3b(n)h(Y~(k+l))+-  •  • 
+(1  —  b(n))b(n)h(Y~  (k  +  L  —  3))  +  b{n)h(Y~  (k  +  L  —  2)) 

Now, 

m— 0 

+(b(n)  +  (1  —  b(ri))b(n )  +  •  •  •  +  (1  —  b{n))L~2b{n))h{Y~  (k)) 

+{b(n)  +  (1  —  b{n))b{n)  +  •  •  •  +  (1  —  b{n))L^3b{n))h{Y~  (k  +  1)) 

+  ---  +  b(n)h(Y~(k  +  L-  2))] 

=  y[(1  ~  (1,7  \{n))L)Z~(k)  +  (1  -  (1  -  6(n))L-1)h(T-(fc))  +  (1  -  (1  -  b{n))L-2)h(Y~ (k  +  1))+ 
L  byn) 

•  •  •  +  b(n)h(Y~  (k  +  L  —  2))]  (5.4) 

Now  applying  standard  martingale  arguments  (as  in  Lemma  4.3  of  [5])  in  the  second  (resp.  third) 
equation  in  (3.7),  the  algorithm  (3.7)  can  be  shown  to  behave  asymptotically  (as  n,  k  — >  oo)  in  the 
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same  manner  as  an  analogous  algorithm  that  has  J(9(n  +  1)  —  8A(n  +  1))  (resp.  J{9(n  +  1)  + 
8A(n  +  1)))  in  place  of  the  terms  h(Y~(k)),  h(Y~(k  +  1)),. . .  ,h(Y~(k  +  L  —  2))  (resp.  h(Y+(k)), 
h{Y+(k  +  1)),. . .  ,h(Y+(k  +  L  —  2))).  Thus  one  can  in  particular  consider  the  latter  algorithm  in 
place  of  (3.7).  Upon  simplification  now  (with  a  slight  abuse  of  notation),  the  RHS  of  (5.4)  can  be 
replaced  by 

=  Ik1  ~  (1t(~  )(n))Vu*0  +  +  1)  -  SA(n  +  D) 


+  <1  y  lj(8("  +  11  -  iA<"  +  !))]• 


Now, 


1  L_1  1  L  V  1  _  (I  _  h(n))L 

£Z-(k  +  m)~  Z-(k)  ||<||  -  £  Z~(k  +  m) - TJ  -  Z-(k) 


L 


m= 0 


L 


m= 0 


Lb(n ) 


+ 


1  —  (1  —  b{n)Y 

Lb(n) 


1  Z~(k) 


<||  LJ(9(n  +  1)  -  SA(n  +  1))  +  (1  1  J(9(n  +  1)  -  SA(n  +  1)) 

byn) 

+  ii  (1-(1..:6(">)L-i)z-w  ii. 


Consider  a  real  valued  function  f(x)  = 


Lb(n) 

1  —  (1  —  x) 


| .  By  L’Hospital’s  rule,  it  is  easy  to  see  that 


/ 1  —  M  —  b(  u))^\ 

lim  fix)  =  L.  It  is  thus  clear  that  lim  - — — -  =  L.  Hence, 

|x|->0  b(n)-¥ 0  \  b(n)  I 


L—l 


L 


Z  (k  +  m)  —  Z  ( k )  || — »  0,  as  k  — >  oo. 


m= 0 


Finally,  the  first  equation  in  (5.3)  can  be  rewritten  as 


9i(n  +  1)  =  7 Tj  9i(n)  +  c(n) 


aln 


c(n ) 


ZL(n)  -  Zl(n) 

28A 


(5.5) 


Moreover,  since  a(n )  =  o(c(n)),  applying  standard  arguments  [10]  to  (5.5)  and  the  second  and  third 
equations  in  (5.3),  one  obtains  the  claim.  □ 

Define  £“(•),  i+(-)  :  [0,  oo)  — >  1Z  and  §(■)  :  [0,  oo)  — >  C  by  z~(s{n ))  =  Z£  (n),  z+(s(n))  =  Z~£ (n), 
9(s(n))  =  9(n)  respectively  Vn,  with  linear  interpolation  on  intervals  [s(n),  s(n  +  1)],  n  >  0. 

Lemma  5.2  For  any  T,  8  >  0,  6(s(n)  +  .)  is  a  bounded  (T,  (5)-perturbation  of  (5.1)  for  sufficiently 
large  n. 

Proof  Rewrite  the  first  equation  in  (3.7)  as  follows:  For  i  =  1, . . . ,  TV, 


9i(m  +  1)  =  7T  i(9i(m)  +  a(m)E 


J{9(m)  —  8A(m))  —  J(9(m)  +  8A(m)) 

28Amj, 


+r]i  (m)  +  m  (m)), 


(5.6) 
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where, 


771(777)  =  a(m)( 


J{6(m)  —  5A{m))  —  J(9(m)  +  5A(m)) 


25  A 


m,i 


-E 


J(9(m)  —  5A(m))  —  J(9(m)  +  5A(m)) 


T'r 


and 


772(777)  =  0(777) 


2(5  A 

Z£(m)  —  Z£  (m)  J(9{m)  —  5A(m ))  —  J(9(m)  +  5A(m )) 


25  Ar 


25  Ar 


Now,  r/i  (to)  becomes  asymptotically  negligible  by  Lemma  5.1  and  772(777,)  vanishes  asymptotically  as 
a  consequence  of  Lemma  4.1  applied  to  Theorem  5.2.  The  algorithm  (3.7)  can  then  be  viewed  as  a 
discretization  of  the  o.d.e.  (5.1)  except  that  as  mentioned  earlier,  it  has  in  addition  asymptotically 
diminishing  error  terms  771(777)  and  772(777.).  Now  a  standard  argument  as  in  pp.  191-194  of  [20], 
proves  the  claim.  □ 

Lemma  5.3  For  any  9{m)  €  C,  for  all  i  =  1, . . . ,  IV, 


lim 

54.0 


E 


J{9{m)  —  5A(m))  —  J(9(m)  +  5A(m)) 
25Am; 


T'r 


-  ViJ(9(m)) 


=  0. 


Proof  Follows  in  a  similar  manner  as  Lemma  A. 5  of  [6].  □ 

Recall  that  the  set  K  =  {6  6  C  \  7 r(VJ(0))  =  0}  is  the  asymptotically  stable  attractor  set  for 
the  o.d.e.  (4.7)  with  «/(•)  itself  serving  as  the  strict  Liapunov  function.  Further,  K'1  =  {9  €  C  \  || 
6  —  9q  || <  77,  #0  ^  K}  represents  the  set  of  points  in  C  that  are  within  an  77-distance  from  the  set 
K.  We  now  have 

Lemma  5.4  Given  77  >0,  there  exists  5o  >  0  such  that  for  all  5  €  (0,  5o],  Kv  is  an  asymptotically 
stable  attractor  set  for  the  o.d.e.  (5.1). 

Proof  As  already  mentioned,  </(•)  itself  serves  as  a  strict  Liapunov  function  for  (4.7)  outside 
the  set  K.  Now  by  Lemmas  5.2  and  5.3,  for  sufficiently  small  5,  «/(■)  will  also  serve  as  a  strict 
Liapunov  function  for  (5.1)  outside  the  set  Kv.  □ 

Finally,  we  come  to  the  main  result  of  this  section. 

Theorem  5.3  Given  77  >  0,  there  exists  5o  >  0  such  that  for  all  5  €  (0,  5o],  9(n)  — >  Kv  a.s. 

Proof  Follows  from  Lemmas  4.1,  5.2  and  5.4.  □ 

This  completes  the  convergence  analysis  of  both  the  algorithms  SPSA-1  and  SPSA-2. 


6  Numerical  Results 

In  this  section,  we  demonstrate  our  algorithms  SPSA-1  and  SPSA-2  by  means  of  a  simple  queueing 
system  and  numerically  compare  their  performance  with  the  algorithms  in  [4]  and  [5],  ( N  +  1)- 
Simulation  FDSA-1,  Two-Simulation  FDSA-1  and  ( N  +  l)-Simulation  FDSA-2.  We  consider  the 
two-node  queueing  network  shown  in  Figure  1  below. 

There  are  two  external  arrival  streams  (one  each)  to  the  two  nodes.  Arrivals  to  the  nodes  from 
these  streams  follow  independent  Poisson  processes  with  rates  Ai  and  A2.  The  service  times  are 
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exponentially  distributed  with  rates  pi(8\)  and  p 2(82),  respectively,  where  6\  and  82  are  parameter 
vectors  at  the  two  nodes.  The  exact  dependence  of  p\  and  p2  on  8\  and  #2,  respectively,  is  given 
below.  A  customer  after  service  at  Node  1  joins  the  queue  at  Node  2.  After  service  at  Node  2,  a 
customer  either  departs  with  probability  p  or  is  fed  back  to  Node  1  with  the  remainder  probability 
of  1  —  p.  Let  Wj(9i),  i  =  1,  2,  j  =  1,  2, . . . ,  represent  the  waiting  time  of  the  jth  customer  at  the 
ith  node  when  the  parameter  value  for  the  ith  node  is  0*.  Our  aim  is  to  find  the  optimum  (joint) 
parameter  vector  (81,82)  within  the  constraint  set  C  which  minimizes  the  sum  of  the  stationary 
mean  waiting  times  in  the  two  queues.  The  constraint  set  C  is  defined  as  follows:  Given  M  >  0, 
each  component  8?,  i  =  1, 2,  j  =  1, . . . ,  M,  takes  values  in  the  interval  8j  min ,  8j  max  and  so  the  set 
C  is  defined  as 


1 

l,min’ 


n 1 

''l,max 


X  .  .  .  X 


M 

l,min’ 


^l,max 


X  ^2, min 


X  .  .  .  X 


M 

2, min’ 


We  assume  that  both  8\  and  82  are  vectors  of  the  same  dimension  M.  Note  that  M  =  N/2.  Thus, 
8i  =  (8j, . . . ,  8^)T ,  i  =  1,2,  and  the  whole  parameter  vector  is  represented  as  8  =  (&\, . . . ,  0jv/, 
8\,. .  ,,02^)T-  Let  8  =  (8\, . . . ,  Sf1, 8\,  ■  ■  ■ ,  8 2I)T  represent  the  target  parameter  (we  will  explain  this 
in  a  moment).  The  dependence  of  the  service  times  on  the  parameters  has  the  following  form: 


f^i(8i)  — 


i+njli  4-% 


,  *—1,2, 


where  Jii,  i  =  1,2,  is  assumed  to  be  constant.  Note  that  the  cost,  which  is  the  sum  of  the  stationary 
mean  waiting  times  in  the  two  queues,  will  be  minimized  if  the  service  rates  are  maximized.  The 
latter  clearly  occurs  at  8  =  8.  Thus  we  know  that  the  optimum  for  our  problem  lies  at  8.  Let  Oj  (0), 
i  =  1,  2,  j  =  1, . . . ,  M,  represent  the  initial  (starting)  values  of  the  parameter  components. 

For  the  simulation  experiments,  we  select  the  following  step-size  sequences  ({a(n)}  and  (6(n)}) 
for  all  the  five  schemes:  a( 0)  =  6(0)  =  1,  a(n)  =  l/n,  b(n)  =  1/n2/3,  n  >  1.  Moreover,  for  SPSA-2, 
we  choose  L  =  100  in  the  experiments.  Thus  data  is  aggregated  in  SPSA-2  over  100  instants 
in  between  two  successive  parameter  updates.  For  both  SPSA-1  and  SPSA-2,  we  choose  random 
variables  A n^,  i  =  1, . . . ,  N,  n  >  0,  to  be  i.i.d.,  Bernoulli  distributed  viz.,  A =  ±1  w.p.  1/2, 
i  =  1, . . . ,  IV,  n  >  0.  We  now  present  the  simulation  results.  We  consider  the  following  set  up  for 
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all  the  three  algorithms. 

Ai  =  0.2, 

a2  =  0.1, 

hi  =  87, 
h2  =  92, 
p  =  0.4, 

Cin  =  0  .l,i  =  l,2,j  =  l,...,M, 

ei max  =  0.6>  i  =  l,2 = 

@i  =  0-3,  i  =  1,2,  j  =  1, . . . ,  M, 

^i(O)  =  0.2,  j  =  1, ...  ,M, 

0i(0)  =0.4,  j  = 

We  consider  two  values  of  M  for  our  experiments:  M  =  5  and  M  =  20.  Thus  the  parameter 
vectors  we  consider  in  the  simulations  using  the  two  schemes  have  dimensions  10  and  40  respectively. 
We  consider  a  total  of  3  x  105  data  aggregation  epochs  for  all  the  five  schemes.  The  corresponding 
total  number  of  parameter  updates  for  each  algorithm  (for  M  =  5  and  20  respectively)  is  shown  in 
Table  1  below. 


Table  1:  Number  of  Parameter  Updates  in  3  x  105  epochs 


Algorithm 

No.  of  Parameter 
Updates  for  M  =  5 

No.  of  Parameter 
Updates  for  M  =  20 

(TV  +  l)-FDSA-2 

3  x  HP 

3  x  105 

SPSA-2 

3  x  103 

3  x  103 

SPSA-1 

92 

92 

(TV  +  1)-FDSA-1 

92 

92 

Two-FDSA-1 

2 

9— 

12 

2— 

_ ID _ 

_ 40 _ 

Note  that  whereas  there  are  3  x  105  parameter  updates  in  (TV  +  l)-Simulation  FDSA-2,  the 
corresponding  number  in  SPSA-2  is  3  x  103  (since  L  =  100).  Moreover,  the  corresponding  number 
in  SPSA-1  and  ( TV  +  l)-Simulation  FDSA-1  is  only  92.  This  is  so  because  for  the  step-size  sequences 
considered  in  the  experiments  viz.,  a(0)  =  6(0)  =  1,  a(n)  =  1/n,  6(n)  =  1/n 2/3,  n  >  1,  the  values  77.92 
and  7793  in  {nm,m  >  1}  (defined  after  (3.2))  are  as  follows:  7792  ~  2.94  X  105  and  7793  ~  3.09  x  105 

respectively.  Finally,  in  Two-Simulation  FDSA-1,  the  number  of  parameter  updates  for  M  =  5 

2 

(mentioned  in  Table  1)  is  9—-  It  is  written  in  this  manner  to  indicate  that  in  addition  to  the 

9  times  that  the  whole  parameter  is  updated  in  3  x  105  data  aggregation  epochs,  the  first  two 

components  of  the  parameter  vector  are  also  updated  for  a  10th  time.  Similarly,  for  M  =  20,  the 

12 

number  of  parameter  updates  in  Two-Simulation  FDSA-1  is  2 — . 

In  what  follows,  we  shall  compare  the  performance  of  our  SPSA  algorithms  (SPSA-1  and  SPSA- 
2)  with  the  algorithms  of  [4]  and  [5]  (viz.,  ( N  +  l)-Sinrulation  FDSA-1,  Two-Simulation  FDSA-1 
and  (N  +  l)-Simulation  FDSA-2),  in  terms  of  speed  of  convergence.  We  choose  the  Euclidean 
distance  between  the  current  parameter  update  and  the  target  parameter  value  as  the  performance 
metric  and  plot  that  w.r.t.  the  number  of  data  aggregation  epochs  for  all  the  five  schemes.  The 
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Euclidean  distance  d(0,  8)  is  defined  by 

d(8, 6)  =  (( 8 4  -  8\)2  +  •  •  •  +  (<  -  8f)2  +  (e\  -  elf  +  •  •  •  +  {e?  -  ^ . 

We  performed  five  independent  replications  of  each  experiment  using  different  seeds.  In  Figures 
2  and  3,  the  mean  trajectories  from  these  experiments  are  plotted  for  all  five  schemes,  for  both 
the  10-dimensional  and  the  40-dimensional  parameter  cases  respectively.  The  standard  error  from 
these  replications  for  the  five  schemes  for  both  cases  was  computed  at  the  end  of  these  simulations, 
and  is  indicated  in  Table  2. 

As  expected,  algorithms  SPSA-1  and  SPSA-2  show  significantly  better  performance  than  the 
rest  of  the  algorithms.  Also,  as  expected,  Two- Simulation  FDSA-1  shows  the  worst  performance 
amongst  the  five  algorithms.  It  is  interesting  to  observe  that  for  the  10-dimensional  parameter 
case,  SPSA-1  shows  the  best  performance  amongst  the  five  algorithms.  Also,  for  this  case,  the 
performance  of  (N  +  l)-Simulation  FDSA-2  is  nearly  as  bad  as  that  of  Two-Simulation  FDSA-1. 
However,  for  the  40-dimensional  parameter  case,  it  is  SPSA-2  (and  not  SPSA-1)  that  shows  the 
best  performance.  Moreover,  the  performance  of  (N  +  l)-Simulation  FDSA-2  for  this  case  is  close 
to  that  of  (N  +  l)-Simulation  FDSA-1,  and  is  considerably  better  than  that  of  Two  Simulation 
FDSA-1. 

The  above  seems  to  indicate  that  for  parameter  dimensions  that  are  not  very  high,  SPSA-1 
(resp.  variants  of  SPSA-1  in  [4])  performs  better  than  SPSA-2  (resp.  ( N  +  l)-Simulation  analogue 
of  SPSA-2  in  [5]).  However,  for  cases  where  the  parameter  dimension  is  high,  the  opposite  is  true. 
The  reason  for  this  could  be  that  SPSA-1  and  its  variants  require  that  the  parameter  be  held 
fixed  over  intervals  of  increasing  size.  Also,  higher  dimensional  parameters  would  typically  require 
several  updates  before  convergence  is  achieved. 

We  observed  that  when  L  =  1  (i.e. ,  no  data  aggregation  in  between  successive  parameter 
updates),  SPSA-2  does  not  exhibit  good  performance.  As  already  mentioned  in  Section  3,  this  is 
probably  because  of  the  fact  that  since  SPSA-2  uses  only  two  parallel  simulations,  the  system  is 
unable  to  adapt  to  the  new  parameter  update  before  it  changes  again.  Data  aggregation  over  L 
epochs  (for  L  >  1 )  on  the  other  hand  leads  to  additional  averaging  and  hence  improved  performance. 
The  choice  of  L  is  completely  arbitrary  though.  We  observed,  however,  that  the  performance 
somewhat  degrades  when  L  is  either  too  low  or  too  high. 

We  observed  that  using  SPSA-1,  for  M  =  5  (i.e.,  number  of  dimensions  of  the  parameter  =  10), 
the  Euclidean  distance  between  the  current  update  and  the  optimum  parameter  became  less  than 
0.10  (on  an  average  of  five  replications)  from  its  28th  parameter  update  onwards  (after  only  3183 
data  aggregation  epochs).  Using  SPSA-2,  the  same  is  achieved  from  its  296th  parameter  update 
(after  2.96  x  104  data  aggregation  epochs).  However,  using  SPSA-1  (after  running  the  algorithm 
long  enough),  it  was  observed  that  for  M  =  20  (i.e.,  number  of  dimensions  of  the  parameter  =  40), 
the  same  distance  became  less  than  0.10  after  its  102nd  parameter  update  (after  nearly  4.75  x  105 
data  aggregation  epochs).  The  same  is  achieved  in  SPSA-2  from  its  829th  update  onwards  (or  after 
only  8.29  x  104  data  aggregation  points). 

It  should  be  noted  that  for  the  same  number  of  data  aggregation  epochs,  SPSA-1,  SPSA-2  and 
Two-Simulation  FDSA-1  require  the  least  number  of  simulations.  In  particular,  for  3  x  105  data 
aggregation  epochs,  the  algorithms  SPSA-1,  SPSA-2  and  Two-Simulation  FDSA-1,  each  require 
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2  X  3  x  105  simulations  for  both  the  10-paranreter  and  the  40-paranreter  vectors,  whereas,  (N  ±  1)- 
Simulation  FDSA-1  and  ( N  ±  l)-Simulation  FDSA-2  require  11  x  3  x  105  simulations  (each)  for  the 
10-parameter  and  41  x  3  x  10'5  simulations  for  the  40-parameter  cases,  respectively. 


Figure  2:  10-Parameter  Vector 


Table  2:  Distance  after  3  x  105  Data  Aggregation  Epochs 


Algorithm 

Mean  ±  Standard  Error 
(10-Parameter  Vector) 

Mean  ±  Standard  Error 
(40-Parameter  vector) 

SPSA-1 

SPSA-2 

(N  ±  1)-FDSA-1 
(N  +  l)-FDSA-2 
Two-FDSA-1 

0.011  ±0.001 

0.051  ±0.02 

0.227  ±0.03 

0.315  ±  0.003 

0.317  ±0.04 

0.121  ±0.03 

0.063  ±0.02 

0.492  ±  0.06 

0.507  ±0.05 

0.782  ±0.05 

On  a  Sun  UltralO  Unix  workstation,  for  M  =  5,  it  took  about  2  minutes  using  SPSA-1  for  the 
Euclidean  distance  from  optimum  to  become  less  than  0.05.  SPSA-2  required  about  3-4  minutes 
for  the  same.  On  the  other  hand,  for  M  =  20,  SPSA-1  took  about  15  minutes,  while  SPSA-2 
required  only  about  5  minutes  for  the  same  to  happen.  ( N  ±  l)-Simulation  FDSA-1  and  ( N  ±  1)- 
Simulation  FDSA-2  (along  with  Two-Simulation  FDSA-1)  took  orders  of  magnitude  more  time  than 
SPSA-1  and  SPSA-2.  For  M  =  20,  after  almost  6.5  x  10'  data  aggregation  epochs  and  running 
for  nearly  21  hours,  the  separation  from  optimum  of  ( N  ±  l)-Simulation  FDSA-1  was  about  0.44, 

while  that  of  ( N  ±  l)-Simulation  FDSA-2  was  about  0.42.  As  expected,  Two-Simulation  FDSA-1 
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showed  the  worst  performance.  For  M  =  20,  after  8 —  parameter  updates  (or  nearly  5  x  108 
data  aggregation  epochs)  and  running  for  almost  18  hours,  the  Euclidean  distance  from  optimum 
using  Two-Simulation  FDSA-1  was  still  about  0.70.  Thus,  our  simulation  experiments  confirm  that 
both  algorithms  SPSA-1  and  SPSA-2  presented  here,  perform  orders  of  magnitude  faster  than  the 
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Figure  3:  40-Parameter  Vector 

algorithms  (. TV  +  l)-Simulation  FDSA-1,  Two-Simulation  FDSA-1  and  (TV  +  l)-Simulation  FDSA-2 
of  [4]  and  [5]. 

7  Conclusions 

We  developed  two  simultaneous  perturbation  stochastic  approximation  (SPSA)  algorithms  (SPSA- 
1  and  SPSA-2)  for  simulation  optimization  of  hidden  Markov  models.  Both  of  these  algorithms  use 
only  two  parallel  simulations  (each)  and  are  generalized  variants  of  the  two-tinrescale  stochastic 
approximation  algorithms  of  [4]  and  [5]  ((TV  +  l)-Simulation  FDSA-1,  Two-Simulation  FDSA-1 
and  (TV  +  l)-Simulation  FDSA-2)  respectively.  Whereas  SPSA-1  updates  the  parameter  over  time 
instants  of  increasing  separation  as  in  [4],  SPSA-2  updates  once  after  every  fixed  number  of  instants. 
The  latter  is  a  generalization  of  the  algorithm  in  [5]  that  updates  the  parameter  at  every  instant;  and 
shows  improved  performance.  The  convergence  analysis  for  both  the  algorithms  was  presented.  We 
conducted  numerical  experiments  with  parameters  of  different  dimensions  on  a  two  node  queueing 
network  model  with  feedback  using  both  the  SPSA  algorithms,  the  Two-Simulation  algorithm  of 
[4]  and  its  (TV  +  l)-Simulation  analogue  and  the  (TV  +  l)-Simulation  algorithm  of  [5].  We  found 
that  the  SPSA  algorithms  converge  orders  of  magnitude  faster  than  the  rest. 


24 


References 

[1]  S.  Andradottir.  Simulation  Optimization.  Chapter  9  in  Handbook  of  Simulation,  ed.  J.  Banks, 
Wiley,  1998. 

[2]  J.  D.  Bartusek  and  A.  M.  Makowski.  On  stochastic  approximation  driven  by  sample  averages: 
convergence  results  via  the  ODE  method.  Technical  Report,  Institute  for  Systems  Research, 
University  of  Maryland,  http://www.isr.umd.edu/TechReports/ISR/1994/TR_94-4/,  1994. 

[3]  S.  Bhatnagar.  Multiscale  stochastic  approximation  schemes  with  applications  to  ABR  service 
in  ATM  networks.  Doctoral  dissertation,  Dept,  of  Electrical  Engineering,  Indian  Institute  of 
Science,  Bangalore,  India,  July  1997. 

[4]  S.  Bhatnagar  and  V.  S.  Borkar.  Multiscale  stochastic  approximation  for  parametric  optimiza¬ 
tion  of  hidden  Markov  models.  Probability  in  the  Engineering  and  Informational  Sciences , 
11:509-522,  1997. 

[5]  S.  Bhatnagar  and  V.  S.  Borkar.  A  two  time  scale  stochastic  approximation  scheme  for  simula¬ 
tion  based  parametric  optimization.  Probability  in  the  Engineering  and  Informational  Sciences, 
12:519-531,  1998. 

[6]  S.  Bhatnagar,  M.  C.  Fu,  and  S.  I.  Marcus.  Optimal  multilevel  feedback  policies  for  ABR  flow 
control  using  two  timescale  SPSA.  Technical  Report,  Institute  for  Systems  Research,  University 
of  Maryland,  http://www.isr.umd.edu/TechReports/ISR/1999/TR_99-18/  (shortened  version 
submitted  for  journal  publication),  1999. 

[7]  S.  Bhatnagar,  M.  C.  Fu,  and  S.  I.  Marcus.  Two  timescale  SPSA  algorithms  for  rate-based  ABR 
flow  control.  Chapter  27  in  System  Theory:  Modeling,  Analysis  and  Control,  ed.  T.Djaferis 
and  I. Schick,  Kluwer  Academic,  1999. 

[8]  V.  S.  Borkar.  On  white  noise  representations  in  stochastic  realization  theory.  SIAM  J.  Control 
and  Optim.,  31:1093-1102,  1993. 

[9]  V.  S.  Borkar.  Probability  Theory:  An  Advanced  Course.  Springer  Verlag,  New  York,  1995. 

[10]  Y.  S.  Borkar.  Stochastic  approximation  with  two  time  scales.  Systems  and  Control  Letters, 
29:291-294,  1997. 

[11]  H.  F.  Chen,  T.  E.  Duncan,  and  B.  P.-Duncan.  A  Kiefer- Wolfowitz  algorithm  with  randomized 
differences.  IEEE  Trans.  Autom.  Contr.,  44(3):442-453,  1999. 

[12]  E.  K.  P.  Chong  and  P.  J.  Ramadge.  Optimization  of  queues  using  an  infinitesimal  perturbation 
analysis  -  based  stochastic  algorithm  with  general  update  times.  SIAM  J.  Contr.  and  Optim., 
31(3):698-732,  1993. 

[13]  E.  K.  P.  Chong  and  P.  J.  Ramadge.  Stochastic  optimization  of  regenerative  systems  using 
infinitesimal  perturbation  analysis.  IEEE  Trans,  on  Autom.  Contr.,  39(7):  1400-1410,  1994. 


25 


[14]  R.  J.  Elliott,  L.  Aggoun,  and  J.  B.  Moore.  Hidden  Markov  Models:  Estimation  and  Control. 
Springer- Ver lag,  New  York,  1995. 

[15]  M.  C.  Fu.  Convergence  of  a  stochastic  approximation  algorithm  for  the  GI/G/ 1  queue  using 
infinitesimal  perturbation  analysis.  J.  Optim.  Theo.  Appl.,  65:149-160,  1990. 

[16]  M.  C.  Fu.  Optimization  via  simulation:  a  review.  Annals  of  Oper.  Res.,  53:199-248,  1994. 

[17]  M.  C.  Fu  and  S.  D.  Hill.  Optimization  of  discrete  event  systems  via  simultaneous  perturbation 
stochastic  approximation.  HE  Trans.,  29(3):233-243,  1997. 

[18]  M.  W.  Hirsch.  Convergent  activation  dynamics  in  continuous  time  networks.  Neural  Networks, 
2:331-349,  1987. 

[19]  Y.  C.  Ho  and  X.  R.  Cao.  Perturbation  Analysis  of  Discrete  Event  Dynamical  Systems.  Kluwer, 
Boston,  1991. 

[20]  H.  J.  Kushner  and  D.  S.  Clark.  Stochastic  Approximation  Methods  for  Constrained  and  Un¬ 
constrained  Systems.  Springer  Verlag,  New  York,  1978. 

[21]  H.  J.  Kushner  and  F.  J.  Vazquez-Abad.  Stochastic  approximation  methods  for  systems  over 
an  infinite  horizon.  SIAM  J.  Contr.  and  Optim.,  34(2):712-756,  1996. 

[22]  H.  J.  Kushner  and  G.  G.  Yin.  Stochastic  Approximation  Algorithms  and  Applications.  Springer 
Yerlag,  New  York,  1997. 

[23]  P.  L’Ecuyer,  N.  Giroux,  and  P.  W.  Glynn.  Stochastic  optimization  by  simulation:  Numerical 
experiments  with  the  M/M/1  queue  in  steady-state.  Management  Science,  40(10):1245-1261, 
1994. 

[24]  P.  L’Ecuyer  and  P.  W.  Glynn.  Stochastic  optimization  by  simulation:  Convergence  proofs  for 
the  GI/G/1  queue  in  steady  state.  Management  Science,  40(11):1562-1578,  1994. 

[25]  M.  Loeve.  Probability  Theory,  vol.  2,  fth  ed.  Springer  Verlag,  New  York,  1977. 

[26]  J.  Neveu.  Discrete  Parameter  Martingales.  North  Holland,  Amsterdam,  1975. 

[27]  G.  C.  Pflug.  Optimization  of  Stochastic  Models:  The  Interface  Between  Simulation  and  Opti¬ 
mization.  Kluwer  Academic,  1996. 

[28]  J.  C.  Spall.  Multivariate  stochastic  approximation  using  a  simultaneous  perturbation  gradient 
approximation.  IEEE  Trans.  Autom.  Cont.,  37(3):332-341,  1992. 

[29]  F.  J.  Vazquez-Abad  and  H.  J.  Kushner.  Estimation  of  the  derivative  of  a  stationary  measure 
with  respect  to  a  control  parameter.  J.  Appl.  Prob .,  29:343-352,  1992. 


26 


